Skip to content

feat: add embedded Process Supervisor for unified process lifecycle#1370

Merged
thedotmack merged 10 commits intomainfrom
feature/process-supervisor
Mar 16, 2026
Merged

feat: add embedded Process Supervisor for unified process lifecycle#1370
thedotmack merged 10 commits intomainfrom
feature/process-supervisor

Conversation

@thedotmack
Copy link
Owner

Summary

Note: Unix domain socket transport was added and then reverted in the final commit — TCP on port 37777 remains the only transport since there's only ever one worker process.

Test plan

  • bun test tests/supervisor/ — 44/44 pass
  • bun test tests/infrastructure/ tests/hooks/ tests/utils/claude-md-utils.test.ts — 200/200 pass
  • curl http://localhost:37777/api/health — 200 OK after build-and-sync
  • Context injection works on session start
  • Verify /api/admin/doctor returns supervisor diagnostics
  • Verify session deletion reaps child processes

🤖 Generated with Claude Code

thedotmack and others added 7 commits March 15, 2026 03:10
…anagement

Consolidates scattered process management (ProcessManager, GracefulShutdown,
HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module.

New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_*
vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill
on Windows), PID file liveness validation, and singleton Supervisor API.

Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions)
Fixes #1356 (zombie TCP socket after Windows reboot)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged
processes on session end. SessionManager.deleteSession() now triggers reaping.
Tightens orphan reaper interval from 60s to 30s.

Fixes #1351 (MCP server processes leak on session end)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces socket-manager.ts for UDS-based worker communication, eliminating
port 37777 collisions between concurrent sessions. Worker listens on
~/.claude-mem/sockets/worker.sock by default with TCP fallback.

All hook handlers, MCP server, health checks, and admin commands updated to
use socket-aware workerHttpRequest(). Backwards compatible — settings can
force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp.

Fixes #1346 (port 37777 collision across concurrent sessions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the fallback path where hook scripts started WorkerService in-process,
making the worker a grandchild of Claude Code (killed by sandbox). Hooks now
always delegate to ensureWorkerStarted() which spawns a fully detached daemon.

Fixes #1249 (grandchild process killed by sandbox)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds 30-second periodic health sweep that prunes dead processes from the
supervisor registry and cleans stale socket files. Adds /api/admin/doctor
endpoint exposing supervisor state, process liveness, and environment health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
64 tests covering all supervisor modules: process registry (18 tests),
env sanitizer (8), shutdown cascade (10), socket manager (15), health
checker (5), and supervisor API (6). Includes persistence, isolation,
edge cases, and cross-module integration scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The socket-manager introduced UDS as default transport, but this broke
the HTTP server's TCP accessibility (viewer UI, curl, external monitoring).
Since there's only ever one worker process handling all sessions, the
port collision rationale for UDS doesn't apply. Reverts to TCP-only,
removing ~900 lines of unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 15, 2026

Code Review: feat: add embedded Process Supervisor for unified process lifecycle

Overall this is a solid, well-scoped PR. The supervisor architecture is clean, the test suite is comprehensive, and the env-sanitizer work is excellent. A few issues worth addressing before merge.


Critical

1. Race condition in signal handlers (src/supervisor/index.ts)

The shutdown handler is async void. If two signals arrive before stopPromise is set, both pass the if (this.stopPromise) guard and race to run shutdownHandler() concurrently:

// Both SIGTERM and SIGINT can hit here before stopPromise is assigned
if (this.stopPromise) {
  logger.warn('SYSTEM', `Received ${signal} but shutdown already in progress`);
  return;
}

Fix: Use a synchronous boolean flag set before the first await:

if (this.shutdownInitiated) return;
this.shutdownInitiated = true;
this.stopPromise = this.shutdownHandler(signal);
await this.stopPromise;

2. reapSession() uses busy-wait polling (src/supervisor/process-registry.ts)

50 iterations × 100ms = 5 seconds of polling per reap. There's also a redundant variable:

// Line ~3551: "remaining" is immediately computed from same data as "survivors"
const remaining = survivors.filter(r => isPidAlive(r.pid));

Consider replacing the poll loop with a brief sleep(500ms) + single recheck, or use wait4 if available. The redundant remaining variable should just reuse survivors.


High

3. Duplicate env variable lists (src/supervisor/env-sanitizer.ts vs /api/admin/doctor endpoint vs ChromaMcpManager)

The env prefix/exact-match lists are duplicated in at least two places. Export them as constants from env-sanitizer.ts and import where needed:

// env-sanitizer.ts
export const ENV_PREFIXES = ['CLAUDECODE_', 'CLAUDE_CODE_'];
export const ENV_EXACT_MATCHES = ['CLAUDECODE', 'CLAUDE_CODE_SESSION', ...];

4. /api/admin/doctor zombie detection returns boolean, not the dead PIDs

const zombiePidFiles = processes.some(p => p.status === 'dead');  // loses which PIDs

This makes the endpoint much less useful for debugging. Return the actual dead PID list:

const deadProcesses = processes.filter(p => p.status === 'dead').map(p => p.pid);

5. Duplicate SIGTERM → SIGKILL cascade logic

shutdown.ts and process-registry.reapSession() both implement the same "SIGTERM, wait, SIGKILL survivors" pattern independently. If one gets a bug fix (e.g., a timeout adjustment), the other won't. Consider having one call the other, or extracting shared logic into a signalWithFallback() helper.

6. Reap failure logged at DEBUG, not WARN

In SessionManager.ts:

} catch (error) {
  logger.debug('SESSION', 'Supervisor reapSession failed (non-blocking)', ...);
}

A process reaping failure during session cleanup is operationally significant. This should be at WARN level so it's visible without enabling DEBUG mode.


Medium

7. reapSession() session ID coercion silently masks bad input

const sessionIdNum = typeof sessionId === 'number' ? sessionId : Number(sessionId) || undefined;

Number("some-uuid") returns NaN, which falls back to undefined. Log entries will say sessionId: undefined with no indication why. Consider a short explicit guard or at least a WARN when coercion fails.

8. ChromaMcpManager casts through unknown to access _process

const chromaProcess = (this.transport as unknown as { _process?: ChildProcess })._process;

This is fragile against MCP SDK updates. The existing if (!chromaProcess?.pid) return; guard is correct for runtime safety, but worth tracking in a TODO comment in case the SDK exposes a cleaner API.

9. Hard-coded SIGTERM/SIGKILL timeouts

5s and 1s are reasonable defaults, but they can't be tuned without a code change. Consider pulling them from settings or at least defining them as named constants at the top of the file.


Tests

The 44-test suite is well-structured and the env-sanitizer tests in particular are excellent. A couple of gaps:

  • reapSession and shutdown tests use impossibly-high PIDs (2147483640 etc.) that are guaranteed dead. The actual SIGTERM/SIGKILL sending logic is never exercised. Consider one integration-style test that spawns a real sleep process and verifies it's killed.
  • Signal handler logic in index.ts is untested — the acknowledged singleton limitation means this code path is entirely covered by manual testing. Worth at least an integration test in CI.

Minor: tree-kill as optionalDependency

"optionalDependencies": {
  "tree-kill": "^1.2.2"
}

Using loadTreeKill() with a silent fallback to taskkill is acceptable. Ensure the fallback path has a timeout on the execFileAsync call so a hung taskkill doesn't block shutdown indefinitely.


Summary

Severity Count Must-fix before merge
Critical 2 Signal handler race, reapSession polling
High 4 Dup env lists, doctor endpoint info loss, dup cascade logic, log level
Medium 3 Session ID coercion, type cast, timeouts
Test gaps 2 Live-process reap test, signal handler test

The architecture itself is sound — session-scoped reaping, persistent registry, env sanitization, and the /doctor endpoint are all good additions. The critical signal handler race and the observable issues with the doctor endpoint are the most important to resolve.

🤖 Generated with Claude Code

Remove unused `acceptingSpawns` field from Supervisor class (written but
never read — assertCanSpawn uses stopPromise instead) and unused
`buildWorkerUrl` import from context handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 15, 2026

PR Review: feat: add embedded Process Supervisor for unified process lifecycle

The direction is correct — centralizing process lifecycle in src/supervisor/ is a clear improvement over the scattered approach. The 44-test suite is a solid start. Below are findings ordered by severity.


Bugs

BUG 1 — validateWorkerPidFile has a hardcoded path and is untestable
src/supervisor/index.ts hardcodes PID_FILE = path.join(homedir(), '.claude-mem', 'worker.pid') without accepting an optional path parameter (unlike runShutdownCascade which accepts pidFilePath). The 'stale' and 'invalid' branches in tests are never exercised as a result. Fix: add an optional pidFilePath parameter matching the pattern in runShutdownCascade.

BUG 2 — stopPromise self-clears, enabling re-entry after shutdown completes

this.stopPromise = runShutdownCascade(...).finally(() => {
  this.started = false;
  this.stopPromise = null;  // ← re-entry is now possible after resolve
});

After stop() resolves, a subsequent call will re-run the cascade. Safe on the current restart path (process exits), but could bite test teardown or future call sites.

BUG 3 — Silent failure when _process PID is unavailable after MCP connect
src/services/worker-service.ts and src/services/sync/ChromaMcpManager.ts both access (transport as unknown as { _process? })._process without logging when it's undefined. If the MCP SDK changes its internals or the process exits immediately, the supervisor has no record of that process and cannot kill it on shutdown. Add a logger.warn when !chromaProcess?.pid.

BUG 4 — Windows SIGTERM does not kill subprocess trees
src/supervisor/shutdown.ts signalProcess(): the tree-kill / taskkill /T path is only used for SIGKILL. SIGTERM on Windows maps to TerminateProcess (immediate kill of root only), so the SIGTERM pass in the shutdown cascade will always exhaust the full 5-second wait on Windows before SIGKILL fires, and subprocess trees will be orphaned.

BUG 5 — Map mutation during iteration in pruneDeadEntries

for (const [id, info] of this.entries) {
  if (isPidAlive(info.pid)) continue;
  this.entries.delete(id);  // mutation during iteration

The ES spec guarantees safety here, but it's a footgun for future readers. Collect IDs first, then delete in a second pass.


Race Conditions

RACE 1 — reapSession Phase 4 unregisters even when SIGKILL survivors remain
src/supervisor/process-registry.ts: after the SIGKILL wait loop, all session records are unconditionally unregistered regardless of whether isPidAlive still returns true. A D-state zombie will be silently dropped. Add a logger.warn when survivors remain post-SIGKILL.

RACE 2 — reapSession blocks up to 5s on already-dead processes
reapSession is called after ensureProcessExit which already sent SIGKILL to the tracked processes. The 5-second REAP_SESSION_SIGTERM_TIMEOUT_MS wait will fire even for processes that are already dead if they're still in the registry. A pruneDeadEntries() call before getBySession() in reapSession would prevent stale entries from triggering unnecessary waits.


Performance

PERF 1 — Synchronous writeFileSync on every register/unregister
src/supervisor/process-registry.ts persist() calls writeFileSync synchronously on every registration and deregistration. Under load (10 concurrent SDK agents), this is 20 blocking disk writes per round-trip. Consider a debounced async flush, with a synchronous flush only during shutdown.


Security

SEC 1 — /api/admin/doctor duplicates sanitizeEnv filter lists out of sync
src/services/server/Server.ts hardcodes envPrefixes and exactMatches arrays that are identical to those in src/supervisor/env-sanitizer.ts. When the sanitizer list is extended, the doctor endpoint's envClean check will be stale. Import and re-use sanitizeEnv instead of duplicating the list.

SEC 2 — isPidAlive returns true on EPERM — PID recycling on shared hosts
On a multi-user system, PID recycling could give a dead worker's PID to another user's process. EPERM from kill(pid, 0) will declare the new process "alive", causing startSupervisor() to throw "Worker already running". This is a known limitation of PID-file supervision — it should be documented in the JSDoc.


Error Handling

ERR 1 — workerHttpRequest hardcodes 127.0.0.1 while buildWorkerUrl uses getWorkerHost()
These should be consistent. If CLAUDE_MEM_WORKER_HOST is ever set to something other than 127.0.0.1, hooks will route to the right host but workerHttpRequest will not.

ERR 2 — Deprecated _port parameters are still accepted by callers
updateFolderClaudeMdFiles and updateCursorContextForProject have _port parameters that are now silently ignored. Callers still pass a port value. Either remove the parameter (breaking change) or add @deprecated JSDoc to prevent confusion.

ERR 3 — timeoutId non-null assertion is fragile
src/services/worker-service.ts:

let timeoutId: ReturnType<typeof setTimeout>;
// ...
clearTimeout(timeoutId!);  // non-null assertion

The Promise executor runs synchronously so this is correct today. Prefer let timeoutId: ReturnType<typeof setTimeout> | undefined with a null-safe clear for resilience.


/api/admin/doctor Endpoint

Good operational addition. Two quick notes:

  1. supervisor.running: true is hardcoded in the response — it's always true since the route only runs if the server is up. Expose getSupervisor().isStarted() (add a method) to make this field meaningful.
  2. The session-deletion reapSession failure is logged at debug level. Given this prevents process leaks, warn would be more appropriate for production visibility.

Test Coverage Gaps

Gap Impact
validateWorkerPidFile 'stale' and 'invalid' branches never hit Medium
reapSession never tested with live processes Medium
Supervisor.start() / stop() not tested in isolation (only integration) Medium
Windows taskkill / tree-kill paths untested Low
pruneDeadEntries double-kill scenario (SIGKILL survivor) Low

Summary

The core architecture is sound and the signal handling + shutdown cascade is well-structured. The most actionable items are:

  1. Fix the /api/admin/doctor env-check list duplication (SEC 1)
  2. Fix Windows SIGTERM not killing subprocess trees (BUG 4)
  3. Fix silent MCP/Chroma PID registration failures (BUG 3)
  4. Add pruneDeadEntries() call before reapSession to avoid 5s waits on already-dead processes (RACE 2)
  5. Consider async/debounced persist() writes (PERF 1)

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Mar 16, 2026

Code Review: feat: add embedded Process Supervisor for unified process lifecycle

Overall this is a well-structured PR. The separation of concerns across supervisor modules is clean, the graceful-degradation philosophy in hooks is correct, and the test suite for process-registry is thorough. A few bugs and design gaps worth addressing before merge.


Bugs

Medium: isPidAlive(0) returns true in process-registry.ts

PID 0 sends signals to the entire process group — it should never be a valid managed process. Returning true silently would prevent it from ever being pruned if a registration bug slips one in. At minimum, log a warning.

Medium: reapSession reports reaped: N before confirming processes are actually dead

logger.info('SYSTEM', `Reaped ${sessionRecords.length} process(es)...`, { reaped: sessionRecords.length });

This logs (and returns) the count of intended kills, not confirmed terminations. If SIGKILL-resistant processes survive, monitoring and callers get a misleading count.

Medium: dataDir in ShutdownCascadeOptions is accepted and passed in tests but never read

export interface ShutdownCascadeOptions {
  dataDir?: string;  // accepted everywhere, never used

Either wire it to the PID file path default or remove it before it confuses future contributors.

Low: stopPromise is nulled in .finally() before the outer await resolves

this.stopPromise = runShutdownCascade({...}).finally(() => {
  this.started = false;
  this.stopPromise = null;  // ← nulled before outer await sees it
});
await this.stopPromise;

A second concurrent stop() caller hitting just after .finally() runs would see null and start a new cascade. The window is tiny given the deduplication guard, but worth a comment or restructure (null after the outer await).

Low: All hook requests default to the 3-second health-check timeout

// worker-utils.ts
const timeoutMs = options.timeoutMs ?? HEALTH_CHECK_TIMEOUT_MS; // 3000ms

session-init, observation, etc. don't specify timeoutMs, so they silently fall back to 3s. On a slow system (DB lock, Chroma indexing), these will time out and drop data. Either raise the default for non-health calls or document that callers must specify timeoutMs explicitly.


Code Quality

zombiePidFiles is misnamed in /api/admin/doctor

const zombiePidFiles = processes.some(p => p.status === 'dead'); // checks registry, not PID file

This checks for dead entries in the process registry, not for a stale worker.pid file. The name is misleading. Also: the doctor endpoint never calls validateWorkerPidFile(), so a stale PID file after a crash won't be surfaced — which is exactly the kind of thing a /doctor endpoint should catch.

Env-check logic in Server.ts duplicates env-sanitizer.ts

// Server.ts lines 310-314 — verbatim copy of sanitizer logic
const envPrefixes = ['CLAUDECODE_', 'CLAUDE_CODE_'];
const exactMatches = [...];

If the sanitizer's list is updated, the doctor endpoint will silently report false "clean" state. Extract the key list to a shared constant.

buildWorkerUrl() exists but workerHttpRequest() rebuilds the URL inline

buildWorkerUrl(path) is exported from worker-utils.ts but not called by workerHttpRequest — the URL is reconstructed inline. Use the shared helper to keep URL construction in one place.

SIGHUP daemon detection reads raw argv

if (process.argv.includes('--daemon')) {  // fragile

If the flag name changes, SIGHUP suppression silently breaks. Consider reading this from the settings/config object.


Test Coverage Gaps

validateWorkerPidFile test in index.test.ts asserts nothing meaningful

it('returns "missing" when PID file does not exist', () => {
  const status = validateWorkerPidFile({ logAlive: false });
  expect(['missing', 'alive', 'stale', 'invalid']).toContain(status);
});

This accepts all four possible return values — it will pass regardless of what the function does and catches no regressions. The test should set up a temp directory with no PID file and assert specifically 'missing'.

Supervisor class cannot be unit-tested due to hardcoded singleton paths

startSupervisor/stopSupervisor bind to the real ~/.claude-mem dir. Compare to createProcessRegistry(path) which is correctly injectable. Consider accepting PID file path as a constructor param on Supervisor so tests can use temp dirs — the existing process-registry tests demonstrate this pattern working well.

health-checker.test.ts never verifies pruneDeadEntries() is called

The tests check setInterval lifecycle but don't assert the pruning callback actually fires. A spy on getProcessRegistry().pruneDeadEntries with a fake timer advancement would make this meaningful.

No test for SIGKILL escalation path in shutdown.test.ts

The "terminates children in reverse spawn order" test is good, but there's no test simulating a process that survives SIGTERM and requires the SIGKILL escalation. Worth adding given the two-phase logic.


Minor

  • getWorkerPort appears imported but unused in observation.ts and session-init.ts
  • tree-kill optional fallback to taskkill should log a warning when tree-kill is missing so Windows users know they're on the fallback path
  • The persist() call on every registry mutation (register, unregister, prune) could be debounced if bulk operations become common

Positive Notes

  • Clean module boundaries — each file has a focused single responsibility
  • Graceful degradation throughout (all hooks fail-open)
  • unref() on the health-check interval is correct
  • requireLocalhost middleware handles the IPv6-mapped ::ffff:127.0.0.1 case
  • loadTreeKill() optional-dep pattern with fallback is clean Windows handling
  • createProcessRegistry(path) factory enabling test isolation is the right design — apply same to Supervisor

🤖 Generated with Claude Code

@thedotmack
Copy link
Owner Author

Review Fixes Applied

Addressed the following issues from the three code reviews:

Must-Fix (7)

  1. Signal handler race condition (src/supervisor/index.ts) — Added synchronous shutdownInitiated boolean flag set before the first await, preventing two rapid signals from both entering the shutdown path.
  2. isPidAlive(0) returns true (src/supervisor/process-registry.ts) — PID 0 sends signals to the entire process group. Now returns false.
  3. Duplicate env variable lists — Exported ENV_PREFIXES and ENV_EXACT_MATCHES from env-sanitizer.ts, imported in Server.ts instead of duplicating.
  4. /api/admin/doctor returns boolean not PIDs — Changed zombiePidFiles: booleandeadProcessPids: number[] for actual diagnostic value.
  5. Reap failure logged at debug — Bumped to logger.warn in SessionManager.ts since process reaping failures are operationally significant.
  6. validateWorkerPidFile untestable — Added optional pidFilePath parameter (matches runShutdownCascade pattern).
  7. Unused getWorkerPort imports — Removed from observation.ts and session-init.ts.

Should-Fix (3)

  1. workerHttpRequest rebuilds URL inline — Now calls buildWorkerUrl() instead of duplicating URL construction.
  2. Unused dataDir in ShutdownCascadeOptions — Removed from interface, implementation, and test call sites.
  3. validateWorkerPidFile test asserts nothing — Rewrote with temp dirs, now tests all 4 return values (missing, invalid, stale, alive).

Test Results

  • 46/46 supervisor tests pass (2 new tests for validateWorkerPidFile edge cases)
  • No new TypeScript errors introduced

🤖 Generated with Claude Code

…tignore, harden supervisor

- Downgrade request/response HTTP logging from info to debug to reduce noise
- Remove unused getWorkerPort imports, use buildWorkerUrl helper
- Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts
- Fix isPidAlive(0) returning true (should be false)
- Add shutdownInitiated flag to prevent signal handler race condition
- Make validateWorkerPidFile testable with pidFilePath option
- Remove unused dataDir from ShutdownCascadeOptions
- Upgrade reapSession log from debug to warn
- Rename zombiePidFiles to deadProcessPids (returns actual PIDs)
- Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns
- Fix tests to use temp directories instead of relying on real PID file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@thedotmack thedotmack merged commit 80a8c90 into main Mar 16, 2026
2 checks passed
@claude
Copy link

claude bot commented Mar 16, 2026

Code Review: PR #1370 — feat: add embedded Process Supervisor

Overall this is a well-structured PR. The supervisor layer is cleanly scoped, the migration from in-process fallback to workerHttpRequest() is a correct mechanical change, and the test suite (44 tests) is comprehensive. A few issues deserve attention before merge.


Bugs

[Medium] Double-unregister of CHROMA_SUPERVISOR_ID

In ChromaMcpManager, there are three places that call unregisterProcess(CHROMA_SUPERVISOR_ID): the once('exit') listener on the MCP process, the transport.onclose handler, and explicitly in stop(). When graceful shutdown occurs (process exits naturally then stop() is called), unregisterProcess fires twice. Map.delete on a missing key is a no-op so it won't crash, but it triggers unnecessary persist() / writeFileSync() calls on each. Consider using a flag or simply removing the once('exit') auto-unregister when the supervisor handles teardown.

[Low] Dead import in user-message.ts

import { ensureWorkerRunning, getWorkerPort, workerHttpRequest } from '../../shared/worker-utils.js';

getWorkerPort is imported but never called after the refactor. Should be removed.

[Low] Supervisor.start() throwing path leaves health checker un-started

async start(): Promise<void> {
  if (this.started) return;
  this.registry.initialize();
  const pidStatus = validateWorkerPidFile({ logAlive: false });
  if (pidStatus === 'alive') {
    throw new Error('Worker already running'); // started remains false, healthChecker never called
  }
  this.started = true;
  startHealthChecker();
}

If start() throws, registry.initialize() has already been called and started is still false. A retry after the stale process exits would skip the health checker since the throw path never reaches startHealthChecker(). This code path has no test coverage.


Design Fragility

[Low] Private _process field access via double cast

const mcpProcess = (transport as unknown as { _process?: ChildProcess })._process;

This accesses a private implementation detail of StdioClientTransport from the MCP SDK. If the SDK renames or removes _process, this silently evaluates to undefined and the process is never registered with the supervisor (the if (mcpProcess?.pid) guard swallows it). The defensive approach is reasonable, but a comment noting the SDK version this was verified against would help future maintainers know when to re-examine it.


Minor Issues

_port parameter on updateCursorContextForProject — the parameter is shadowed with an underscore and silently ignored ("legacy, now resolved automatically"). It should be formally removed from the signature rather than renamed, so callers aren't confused about what to pass.

Cosmetic: indentation in shutdown.test.ts around line 108:

      await runShutdownCascade({
        registry,
        currentPid: process.pid,
          pidFilePath: path.join(tempDir, 'worker.pid')  // extra indent
      });

Cosmetic: .gitignore is missing a newline at end of file (the \No newline at end of file marker in the diff).


Test Coverage Notes

Coverage is solid overall. A few gaps worth noting:

  • No test for the Supervisor.start() throwing path described above.
  • reapSession tests use dead PIDs and test the "all already dead" path well, but the SIGTERM → wait → SIGKILL escalation path with a live process isn't covered (likely intentional for isolation).
  • workerHttpRequest has no direct unit tests; coverage is indirect through mock-based tests.

Performance Notes

  • persist() (via writeFileSync) is called synchronously on every register() / unregister(). Fine at current scale (< 10 managed processes), but worth a comment noting the sync write is intentional.
  • getAll() does an O(n log n) sort on every call; getByPid() calls it and then filters, where a direct Map lookup would suffice. Not a real concern at current scale but easy to improve later.
  • Orphan reaper interval halved from 60s → 30s (noted in commit but not the PR description).

Summary

Priority Issue
Medium Double-unregister in ChromaMcpManager causing extra disk writes
Low Dead import of getWorkerPort in user-message.ts
Low Supervisor.start() throw path skips health checker, no test coverage
Low _process private field cast — document SDK version or track as TODO
Minor Remove _port param from updateCursorContextForProject signature
Cosmetic Indentation in shutdown.test.ts, missing EOF newline in .gitignore

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant